Skip to content

fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push#431

Merged
james-tn merged 2 commits into
mainfrom
fix/eval-foundry-remote-client
Jun 10, 2026
Merged

fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push#431
james-tn merged 2 commits into
mainfrom
fix/eval-foundry-remote-client

Conversation

@james-tn

Copy link
Copy Markdown
Contributor

Summary

Fixes three evaluation-correctness defects discovered while running the MLADS agent-evaluation lab locally. Together they made two metrics — tool_call_accuracy and task_adherence — report misleadingly low scores that reflected measurement bugs, not real agent quality. Also includes a fix that makes Foundry --remote push work again on the current SDK.

Changes

1. Foundry --remote push (azure-ai-projects 2.1.0)

get_openai_client(default_query={"api-version": ...}) breaks on azure-ai-projects >= 2.1.0: the new unified /v1 Foundry path rejects the legacy api-version query parameter (400: api-version query parameter is not allowed when using /v1 path), so every --remote run failed to push results. Now calls get_openai_client() plainly with a fallback for older SDKs.

2. Streaming tool-call capture (agent-framework ≥ 1.7)

Every function_call delta chunk now carries the tool name + a stable call_id, but the code assumed only the first chunk had a name. Each argument fragment ({", customer, _id, …) was finalized as its own malformed tool call, turning one get_customer_detail({"customer_id":5}) into 6+ garbage calls with {"_raw": ...} args (and spamming duplicate tool_called UI events). track_function_call_start() now de-duplicates on call_id.

Before: 6–42 malformed fragment calls per query → After: 2–4 clean calls, 0 malformed.

3. Tool results captured + fed to the task-adherence judge

  • function_result output (keyed by call_id) was discarded; the mixin now records it via track_function_result().
  • evaluate_task_adherence passed tool calls but no tool results, so grounded answers looked fabricated (judge: "no corroborating tool interactions"). It now emits role:tool result messages so the judge can verify grounding.
  • azure-ai-evaluation 1.14.0 made TaskAdherenceEvaluator a binary "flagged" grader (score ∈ {0,1} + pass/fail result). The old code treated it as 1–5 and thresholded >= 3, so every PASS was recorded as a fail and displayed as 1.0/5. Scoring now honors the pass/fail result and maps the binary verdict to the 1–5 display.

Verification

5-case handoff run (local):

  • Tool calls captured cleanly (2–4/query, 0 malformed vs 6–42 before).
  • task_adherence average 0.2 → 3.4; the remaining low scores are genuine hallucinations (agent citing payment data absent from tool output) — exactly what the metric should catch.
  • --remote push verified working (portal link generated).

Scope / risk

Touches the shared ToolCallTrackingMixin and the single/handoff/reflection agents' streaming loops. Changes are strict improvements: fewer duplicate UI events and correct argument parsing. The Autogen serialization path (backend.py) is unaffected (Autogen delivers complete arguments, not deltas). No existing unit tests cover these paths; validated via live runs.

James N. and others added 2 commits June 9, 2026 19:25
get_openai_client(default_query={'api-version': ...}) breaks on
azure-ai-projects >= 2.1.0: the new unified /v1 Foundry path rejects the
legacy api-version query parameter (400: 'api-version query parameter is
not allowed when using /v1 path'), so every --remote run failed to push
results to Azure AI Foundry.

Call get_openai_client() without forcing api-version, falling back to an
explicit api_version kwarg only for older SDKs that require it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three related defects made tool_call_accuracy (~2.0) and task_adherence
(~0.2) misleadingly low, independent of real agent quality.

1. Streaming tool-call capture (agent-framework >= 1.7): every function_call
   delta chunk now carries the tool name + a stable call_id, but the code
   assumed only the first chunk had a name. Each argument fragment was
   finalized as its own malformed tool call, turning one
   get_customer_detail({"customer_id":5}) into 6+ garbage calls with
   {"_raw": ...} args (and spamming duplicate tool_called UI events).
   track_function_call_start() now de-duplicates on call_id and only
   starts/broadcasts a genuinely new call. One real call = one clean call.

2. Tool results were never captured. function_result content carries the
   tool output (keyed by call_id) but was discarded. The mixin now records
   it via track_function_result(), so backend tools_used and the evaluator
   can see what each tool returned.

3. task_adherence judge could not verify grounding. metrics.py passed tool
   calls but no tool results, so grounded answers looked fabricated (judge:
   "no corroborating tool interactions"). It now emits role:tool result
   messages. Additionally, azure-ai-evaluation 1.14.0 made
   TaskAdherenceEvaluator a binary "flagged" grader (score in {0,1} plus a
   pass/fail result); the old code treated it as 1-5 and thresholded >= 3,
   so every PASS was recorded as a fail and shown as 1.0/5. Scoring now
   honors the pass/fail result and maps the binary verdict to a 1-5 display.

Verified on a 5-case handoff run: tool calls captured cleanly (2-4 per
query, 0 malformed); task_adherence avg 0.2 -> 3.4, with the remaining low
scores being genuine hallucinations (agent citing data absent from tool
output) rather than measurement artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@james-tn james-tn merged commit 9c91bf7 into main Jun 10, 2026
18 checks passed
@james-tn james-tn deleted the fix/eval-foundry-remote-client branch June 10, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant